A common critique of modern songs is that they are overly-simple (in both textual and conceptual content) and far too repetitive. This critique is most often advanced in conjunction with claims about the changing mentalities of newer generations. To test this critique, this investigation will explore several different aspects of lyrical complexity in the lyrics database scraped from MetroLyrics. To begin, we will not only be using stem words, but also the raw lyrics themselves (after removing all brackets and text within, abbreviation, punctuation, whitespace, etc…). This is necessary as most of the repetition in lyrics will come from the raw lyrics themselves as opposed to the already pruned stem words.
The analysis will analyze at trends in lyric complexity over time, genre, artist popularity, and more.
Our first metric for lyrical complexity is as follows. From each song, the count of unique words from the cleaned lyrics is taken. Then, to standardize for difference in song lengths, the number of unique words is divided by the total words in the song to arrive at a percent unique lyrics value. This process is repeated for stem words. The dataset is then subset by time and genre, and the average percent unique lyrics (as well as stem words) is taken across these factors. Due to imbalance in the dataset, the results are confined to years past which each genre has over 100 songs recorded. Folk and R&B are excluded from this time series analysis due to lack of data.
As we can see from both of the graphs below, the percent unique words and stem words has been on the decline for most genres observed, although Jazz has remained relatively stable.
Note: Double click on a genre in the legend to isolate its graph.
We can use the below slope graph to take a closer look at the drop each genre takes from data starting point to present day. The difference in population means from starting to present is statistically significant at 95% confidence levels for all genres but jazz.
From here, the data is subset by both genre and artist, and various aggregate statistics are compiled. Notably, the artists within each genre that are in the 99th percentile for number of songs in the database are marked as ‘top artists’. An artists number of songs in this database is in this manner used as a metric for popularity. We then looked at the percent of unique words typically used in songs made by top artists in a genre vs the rest. Although it was hypothesized more popular artists would have simpler lyrical complexity, ultimately there was no significant difference accross groups, save for Folk and Metal.
Another metric accessed was average word length (in characters) for a given song. This was meant to be a metric for lyrical complexity, with longer average word lengths suggesting higher level text. However, there was almost no difference in average word length accross genres.
Lastly, average word and stem word count were tracked accross time and genre. Interestingly, most genres remained constant over time except for Hip-Hop, which lost roughly 40 words on average per song. As Hip-Hop has constinously grown mainstream over the past decade, this decline perhaps makes sense as the genre becomes more accessible to the masses. Notably, the average word count in a Hip-Hop song is still well above other genres.
Since Hip-Hop was one of the genres that most declined in complexity (via both unique word and stem word percentages, and average word counts), let us take a further look. We previously posited that perhaps the genre’s rising degree of mainstream popularity is behind the changes, but maybe the content can shed insight as well. Below is a word cloud of the most popular words in Hip-Hop in the year 2002, and below that is the same figure for 2016.
We can see a marked increase in profanity and words with violent contonations going from 2002 to 2016. Notably the once prominent “girl” is replaced by the corresponding deragotory term. There are also far more references to money and luxury items now than there were in the past. Hip-Hop’s previously discerned shift in lyrical and textual complexity over the past two decades might then be driven by this corresponding change in lyrical content.
In general, the analysis has shown that there has been a discernible trend towards less complex song lyrics. A limitation of this analysis is that primarily lyrics from songs made past 2000 were used (in the time series analysis). A larger data set would be helpful for more conclusive claims. Additionally, using percent unique word count can be improved with perhaps an algorithmic approach to find repeated phrases as opposed to words.